Metadensity: a background-aware python pipeline for summarizing CLIP signals on various transcriptomic sites

Abstract Motivation Cross-linking and immunoprecipitation (CLIP) is a technology to map the binding sites of RNA-binding proteins (RBPs). The region where an RBP binds within RNA is often indicative of its molecular function in RNA processing. As an example, the binding sites of splicing factors are found within or proximal to alternatively spliced exons. To better reveal the function of RBPs, we developed a tool to visualize the distribution of CLIP signals around various transcript features. Results Here, we present Metadensity (https://github.com/YeoLab/Metadensity), a software that allows users to generate metagene plots. Metadensity allows users to input features such as branchpoints and preserves the near-nucleotide resolution of CLIP technologies by not scaling the features by length. Metadensity normalizes immunoprecipitated libraries with background controls, such as size-matched inputs, then windowing in various user-defined features. Finally, the signals are averaged across a provided set of transcripts. Availability and implementation Metadensity is available at https://github.com/YeoLab/Metadensity, with example notebooks at https://metadensity.readthedocs.io/en/latest/tutorial.html. Supplementary information Supplementary data are available at Bioinformatics Advances online.

Enriched RBP binding at specific transcript features provide important clues to the function of the RBP. To illustrate, spliceosomal proteins are enriched at the 5 0 -and 3 0 -splice sites (ss) (Moore and Sharp, 1993), and RNA decay factors often interact within the 3 0untranslated regions (UTRs) of protein-coding genes (Muers, 2013). By examining the distribution of RBP-binding sites around canonical features in genes, one can infer the functions of RBPs.
The distribution of transcriptome-wide signals is often summarized in metagene plots. However, existing metagene packages (Olarerin-George and Jaffrey, 2017) emphasize the 5 0 -UTR-CDS-3 0 -UTR model on mature messenger RNAs (mRNAs). Such a model is useful in studying RNA stability and/or translational regulators. However, many RBPs bind premature mRNAs to regulate splicing, polyadenylation and export (Hentze et al., 2018). To thoroughly comprehend an RBP's role in RNA processing, a software tool that includes multiple models of metagene density is needed. In addition, CLIP-seq data contain various background signals (Van Nostrand et al., 2016) and existing metagene packages do not support background normalization. The coverage at each position is strongly influenced by the expression level of the substrate. The use of a sizematched input (SMInput) library in eCLIP accounts for non-specific background signal in the identical size range on the membrane as well as any inherent biases in ligations, reverse transcriptasepolymerase chain reaction, gel migration and transfer steps (Van Nostrand et al., 2016). Thus, when determining binding distributions, it is crucial to consider the background signal.
Here, we present Metadensity, a python package that supports multiple types of metagene plots and allows user-customized feature creation. In addition, it has a built-in normalization procedure to account for background in the SMInput library. Finally, it allows the user to not only utilize the read coverage as an approximation of binding, but also support the extraction of various diagnostic signals such as CITs and CIMs.

Overview
Metadensity starts by extracting CLIP diagnostic signals from BAM/ BIGWIG files for each transcript, using either the read coverage or summation of CITs and CIMs. Alternatively, to speed up computation, a WIG track can be pre-computed (Fig. 1A), which allows us to accommodate other sequencing technologies that have signals and backgrounds in the format of BIGWIGs. The software package performs transcript-level normalization by calculating the relative information comparing IP to SMInput (Fig. 1A, middle) (Van Nostrand et al., 2020). For each nucleotide of the transcript, relative information content represents the transcript-level enrichment of IP signal over the background (SMInput). Specifically, this value encodes the relative entropy that reflects the contribution of each nucleotide (see Supplementary Methods). Lastly, users can define the length of a 'fixed window' to extend from the 5 0 -and 3 0 -boundary of a transcriptomic feature. The relative information content values are extracted for each 'window' for further analysis or visualization (Fig. 1B). Metadensity outputs RBP maps (Fig. 1B), which contains the values for each individual transcript, or the mean/median across all transcripts (Fig. 1C).

Conclusion
Here, we provide a user-friendly package to generate various metagene plots for visualizing CLIP-seq data, including pre-mRNA features such as branchpoints and polyadenylation sites. The package takes outputs from the eCLIP pipeline, fetches diagnostic signals, performs background normalization and outputs RBP maps for transcriptome-wide eCLIP visualization. Users can utilize these visualizations to interrogate RBP functions. We showcase how the U2 and SF3B complex's density align with current knowledge and their role in the spliceosome. Similarly, U2 proteins have strongest binding at the 3 0 -ss. The various metagene models will allow us to propose testable hypotheses for RBPs on their impact in various steps of RNA-processing.

Funding
This work was supported by US National Institutes of Health research grants HG004659 and HG009889.